{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0174eb96",
   "metadata": {},
   "source": [
    "# Bring your own LLMs\n",
    "\n",
    "Ragas uses langchain under the hood for connecting to LLMs for metrices that require them. This means you can swap out the default LLM we use (`gpt-3.5-turbo-16k`) to use any 100s of API supported out of the box with langchain.\n",
    "\n",
    "- [Completion LLMs Supported](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.llms)\n",
    "- [Chat based LLMs Supported](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.chat_models)\n",
    "\n",
    "This guide will show you how to use another or LLM API for evaluation."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43b57fcd-5f3f-4dc5-9ba1-c3b152c501cc",
   "metadata": {},
   "source": [
    ":::{Note}\n",
    "If your looking to use Azure OpenAI for evaluation checkout [this guide](./azure-openai.ipynb)\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55f0f9b9",
   "metadata": {},
   "source": [
    "## Evaluating with GPT4\n",
    "\n",
    "Ragas uses gpt3.5 by default but using gpt4 for evaluation can improve the results so lets use that for the `Faithfulness` metric\n",
    "\n",
    "To start-off, we initialise the gpt4 `chat_model` from langchain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6d96660",
   "metadata": {},
   "outputs": [],
   "source": [
    "# make sure you have you OpenAI API key ready\n",
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"your-openai-key\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6906a4d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "\n",
    "gpt4 = ChatOpenAI(model_name=\"gpt-4\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1fdb48b",
   "metadata": {},
   "source": [
    "In order to you the Langchain LLM you have to use the `RagasLLM` wrapper. This help the Ragas library specify the interfaces that will be used internally by the metrics and what is exposed via the Langchain library. You can also use other LLM APIs in tools like LlamaIndex and LiteLLM but creating your own implementation of `RagasLLM` that supports it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "25c72521-3372-4663-81e4-c159b0edde40",
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragas.llms import LangchainLLM\n",
    "\n",
    "gpt4_wrapper = LangchainLLM(llm=gpt4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62645da8-6a52-4cbb-bec7-59f7e153cd38",
   "metadata": {},
   "source": [
    "Substitute the llm in `Metric` instance with the newly create GPT4 model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "307321ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragas.metrics import faithfulness\n",
    "\n",
    "faithfulness.llm = gpt4_wrapper"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1930dd49",
   "metadata": {},
   "source": [
    "That's it! faithfulness will now be using GPT-4 under the hood for evaluations.\n",
    "\n",
    "Now lets run the evaluations using the example from [quickstart](../quickstart.ipnb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "62c0eadb",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Found cached dataset fiqa (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/ragas_eval/1.0.0/3dc7b639f5b4b16509a3299a2ceb78bf5fe98ee6b5fee25e7d5e4d290c88efb8)\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f19d5352c37247b78228209a8fced52b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "DatasetDict({\n",
       "    baseline: Dataset({\n",
       "        features: ['question', 'ground_truths', 'answer', 'contexts'],\n",
       "        num_rows: 30\n",
       "    })\n",
       "})"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# data\n",
    "from datasets import load_dataset\n",
    "\n",
    "fiqa_eval = load_dataset(\"explodinggradients/fiqa\", \"ragas_eval\")\n",
    "fiqa_eval"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "c4396f6e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "evaluating with [faithfulness]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████| 1/1 [06:03<00:00, 363.87s/it]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'faithfulness': 0.7667}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# evaluate\n",
    "from ragas import evaluate\n",
    "\n",
    "result = evaluate(\n",
    "    fiqa_eval[\"baseline\"].select(range(5)),  # showing only 5 for demonstration\n",
    "    metrics=[faithfulness],\n",
    ")\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f490031e-fb73-4170-8762-61cadb4031e6",
   "metadata": {},
   "source": [
    "## Evaluating with Open-Source LLMs\n",
    "\n",
    "You can also use any of the Open-Source LLM for evaluating. Ragas support most the the deployment methods like [HuggingFace TGI](https://python.langchain.com/docs/integrations/llms/huggingface_textgen_inference), [Anyscale](https://python.langchain.com/docs/integrations/llms/anyscale), [vLLM](https://python.langchain.com/docs/integrations/llms/vllm) and many [more](https://python.langchain.com/docs/integrations/llms/) through Langchain. \n",
    "\n",
    "When it comes to selecting open-source language models, there are some rules of thumb to follow, given that the quality of evaluation metrics depends heavily on the model's quality:\n",
    "\n",
    "1. Opt for models with more than 7 billion parameters. This choice ensures a minimum level of quality in the results for ragas metrics. Models like Llama-2 or Mistral can be an excellent starting point.\n",
    "2. Always prioritize finetuned models over base models. Finetuned models tend to follow instructions more effectively, which can significantly improve their performance.\n",
    "3. If your project focuses on a specific domain, such as science or finance, prioritize models that have been pre-trained on a larger volume of tokens from your domain of interest. For instance, if you are working with research data, consider models pre-trained on a substantial number of tokens from platforms like arXiv or Semantic Scholar.\n",
    "\n",
    ":::{note}\n",
    "Choosing the right Open-Source LLM for evaluation can by tricky. You can also fine-tune these models to get even better performance on Ragas meterics. If you need some help/advice on that feel free to [talk to us](https://calendly.com/shahules/30min)\n",
    ":::\n",
    "\n",
    "In this example we are going to use [vLLM](https://github.com/vllm-project/vllm) for hosting a `HuggingFaceH4/zephyr-7b-alpha`. Checkout the [quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html) for more details on how to get started with vLLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85e313f2-e45c-4551-ab20-4e526e098740",
   "metadata": {},
   "outputs": [],
   "source": [
    "# start the vLLM server\n",
    "!python -m vllm.entrypoints.openai.api_server \\\n",
    "    --model HuggingFaceH4/zephyr-7b-alpha \\\n",
    "    --host 0.0.0.0 \\\n",
    "    --port 8080"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9ddf74a-9830-4e1a-a4dd-7e5ec17a71e4",
   "metadata": {},
   "source": [
    "Now lets create an Langchain llm instance and wrap it with `RagasLLM` class. Because vLLM can run in OpenAI compatibilitiy mode, we can use the `ChatOpenAI` class as it is with small tweaks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "2fd4adf3-db15-4c95-bf7c-407266517214",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "from ragas.llms import LangchainLLM\n",
    "\n",
    "inference_server_url = \"http://localhost:8080/v1\"\n",
    "\n",
    "# create vLLM Langchain instance\n",
    "chat = ChatOpenAI(\n",
    "    model=\"HuggingFaceH4/zephyr-7b-alpha\",\n",
    "    openai_api_key=\"no-key\",\n",
    "    openai_api_base=inference_server_url,\n",
    "    max_tokens=5,\n",
    "    temperature=0,\n",
    ")\n",
    "\n",
    "# use the Ragas LangchainLLM wrapper to create a RagasLLM instance\n",
    "vllm = LangchainLLM(llm=chat)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2dd7932a-7933-4de8-a6af-2830457e02a0",
   "metadata": {},
   "source": [
    "Now lets import all the metrics you want to use and change the llm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "20882d05-1b54-4d17-88a0-f7ada2d6a576",
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragas.metrics import (\n",
    "    context_precision,\n",
    "    faithfulness,\n",
    "    context_recall,\n",
    ")\n",
    "from ragas.metrics.critique import harmfulness\n",
    "\n",
    "# change the LLM\n",
    "\n",
    "faithfulness.llm = vllm\n",
    "context_precision.llm = vllm\n",
    "context_recall.llm = vllm\n",
    "harmfulness.llm = vllm"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58a610f2-19e5-40ec-bb7d-760c1d608a85",
   "metadata": {},
   "source": [
    "Now you can run the evaluations with and analyse the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "d8858300-7985-4c79-8d03-c671afd645ac",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "evaluating with [faithfulness]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████| 1/1 [06:25<00:00, 385.74s/it]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'faithfulness': 0.7167}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# evaluate\n",
    "from ragas import evaluate\n",
    "\n",
    "result = evaluate(\n",
    "    fiqa_eval[\"baseline\"].select(range(5)),  # showing only 5 for demonstration\n",
    "    metrics=[faithfulness],\n",
    ")\n",
    "\n",
    "result"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
